In previous sessions, we've talked about the underlying data structures in the Pandas library. We've seen how to manipulate DataFrame and Series objects in order to answer questions regarding the data. Last week, we also saw how to use matplotlib to visualize our analysis.

This week we will be diving into an important topic in Pandas: aggregating and performing operations on specified groups of data without modifying the underlying structure. According to Wes McKinney,

Categorizing a data set and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. Pandas provides a flexible and high-performance groupby facility, enabling you to slice and dice, and summarize data sets in a natural way.


In [ ]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)

Data Aggregation and Group Operations


In [ ]:
pd.options.display.notebook_repr_html = False
%matplotlib inline

GroupBy mechanics

Pandas was designed with a considerable deference to the progress made in data aggregation techniques by developers for the R programming language. The main mechanism is the split-apply-combine paradigm:

  1. Data is split into groups based on one or more provided keys,
  2. A function is applied to each group,
  3. The results of all the function applications are combined into a result object.

As we will see, grouping keys are very flexible in nature. Some possible types of keys are

  • A list or array of values sharing the length of the grouped column
  • A value indicating a column name
  • A dict or Series giving a correspondence between the values on the axis being grouped and the group names
  • A function to be invoked on the axis index or the individual labels in the index

In [ ]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})
df

We can compute the mean of the column corresponding to "data1" by using the group labels from "key1". This can be done in a number of ways, but here is a straightforward example:


In [ ]:
grouped = df['data1'].groupby(df['key1'])
grouped

Notice that grouped is its own Pandas object, just like a Series or DataFrame. Now we can compute simple statistics just like we would with other objects.


In [ ]:
grouped.mean()

We can also pass an array of keys:


In [ ]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

This forms a grouping using a heierarchical index, as we have seen earlier. To flatten the hierarchical index, as we have seen, we can call unstack().


In [ ]:
means.unstack()

In these examples, the keys refer to Series, though they could really be anything so long as the lengths match up. For example, consider the following key arrays.


In [ ]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

If you're just interested in the column names, you can simply pass the identifying string, or list of strings, as in the following examples:


In [ ]:
df.groupby('key1').mean()

In [ ]:
df.groupby(['key1', 'key2']).mean()

In [ ]:
df.groupby(['key1', 'key2']).size()

Iterating over groups

groupby() supports iteration. Specifically, groupby() produces tuples containing the group name along the relevant data. For example:


In [ ]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

If multiple keys are being passed, you can overload the name by making a $n$-tuple of keys. For example:


In [ ]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

This process is in general quite flexible. For example, suppose you want to store the relevant DataFrame groups as a native-Python dict. In this case, we can just wrap the groupby() call with a list and a dict, which will thus be stored into the dict.


In [ ]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

By default, groupby groups on axis=0, which corresponds to treating columns of data. Of course, you can specify which axis you desire.


In [ ]:
df.dtypes

In [ ]:
grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))

Selecting a column or subset of columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the efect of selecting those columns for aggregation. This means that:

df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

is effectively identical to

df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

Why is this useful? If you're working with a large dataset for which aggregating the entire DataFrame is out of the question, you can speed up the process by specifying the particular columns you are interested in.

Here's an example: we can group a DataFrame according to a set of keys, specify a particular column (in this case, data2), and take the resulting mean.


In [ ]:
df.groupby(['key1', 'key2'])[['data2']].mean()

The object returned here is a grouped DataFrame if a list or array is passed and a grouped Series if just a column name is passed.


In [ ]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

In [ ]:
s_grouped.mean()

Grouping with dicts and Series

Grouping information may exist in a form other than an array. Let's consider another example DataFrame.


In [ ]:
people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan # Add a few NA values
people

Suppose we have a group correspondence for the columns and want to sum together the columns by group.


In [ ]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

The groupby method can use the dict natively, allowing you to form GroupBy objects on the fly.


In [ ]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

The same holds for Series objects, which are structurally similar to dicts.


In [ ]:
map_series = Series(mapping)
map_series

In [ ]:
people.groupby(map_series, axis=1).count()

Grouping with functions

In Python, functions are simply another data type. You can actually use groupby to isolate members of your data set according to a rule, defined by a function. For example:


In [ ]:
people.groupby(len).sum()

You can mix and match functions with other data types that we have seen above.


In [ ]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Grouping by index levels

Finally, you can use heierarchical indexing to perform groupby operations. To do this, pass the level number or name using the level keyword.


In [ ]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]], 
                                    names=['cty', 'tenor'])
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

In [ ]:
hier_df.groupby(level='cty', axis=1).count()

Data aggregation

Wes McKinney defines data aggregation as any transformation that takes a dataset or other array and produces scalar values. For example, the simple statistical functions such as

  • mean
  • max
  • min
  • sum

are examples of operations taking arrays to numbers. Many of the aggregations that we have seen so far have been optimized for performance, but Pandas gives you the functionality to implement customized aggregators.


In [ ]:
df

For example, we can use quantile(x) (not explicitly implemented for GroupBy), which determines the value of $x$th percentile given a Series of data.


In [ ]:
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

More to the point, you can define your own functions and do groupby operations with them. For example, if you are interested in the range of your data sets, you can use:


In [ ]:
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

Even method that aren't really aggregations, such as describe, still perform useful operations on GroupBy objects.


In [ ]:
grouped.describe()

We will continue with the tips dataset from previous weeks to show off some of the more advanced features of aggregation. The data set can be found on the course webpage, or here, if you're lazy (like us).


In [ ]:
tips = pd.read_csv('tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]

Column-wise and multiple function application

As we've seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending on the column or multiple functions at once. Fotunately, this is straightforward to do, which we will illustrate through a number of examples. First, let's group the tips by sex and smoker.


In [ ]:
grouped = tips.groupby(['sex', 'smoker'])

Descriptive statistics, such as mean, can be passed to the aggregator as a string.


In [ ]:
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')

You can also pass a list of functions to do aggregation. If the function is built-in, it passes as a string. Otherwise, one can simply pass the function on its own.


In [ ]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

To label the columns assigned by agg, pass a tuple in for each function, with the first element corresponding to the label of the column.


In [ ]:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

With a DataFame you have more options as you can specify as list of functions to apply to all of the columns or different functions per column.


In [ ]:
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

Here we are using what effectively amounts to a hierarchical index, which we can then slice by choosing columns and subcolumns.


In [ ]:
result['tip_pct']

As above, a list of tuples with custom names can be passed:


In [ ]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)

You can also specify the particular column of a DataFrame that you want to do aggregation on by passing a dict of information. For example,


In [ ]:
grouped.agg({'tip' : np.max, 'size' : 'sum'})

You can overload a particular column's aggregator functions by making the dict key corresponding to the column reference a list rather than just one function.


In [ ]:
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})

Returning aggregated data in "unindexed" form

Of course, you can unindex a hierarchical indexed GroupBy object by Specifying the as_index optional paramter.


In [ ]:
tips.groupby(['sex', 'smoker'], as_index=False).mean()

Group-wise operations and transformations

Aggregation is but one kind of group operation. It is a special case of more general data transformations, taking one-dimensional arrays and reducing them to scalars. Here we will generalize this notion by showing you how to use apply and transform methods on DataFrame objects. Let's revisit our old DataFrame of random data.


In [ ]:
df

Suppose we want to add a column containing group means for each index. One way to do this is to aggregate, then merge:


In [ ]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means

In [ ]:
pd.merge(df, k1_means, left_on='key1', right_index=True)

This works but is somewhat inflexible. You can think of the operation as transforming the two data columns using the np.mean function. Returning to the people DataFrame from before, we can use the transform method on GroupBy.


In [ ]:
key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()

In [ ]:
people.groupby(key).transform(np.mean)

What is going on here is that transform applies a function to each group, then places the results in the appropriate locations. When this reduces to the special case of scalar values, the answer is simply broadcasted across all the relevant locations.

Suppose instead you wanted to subtract the mean value from each group. To do so, let's define a function demean, and proceed by


In [ ]:
def demean(arr):
    return arr - arr.mean()
demeaned = people.groupby(key).transform(demean)
demeaned

You can check that demeanded now has zero group means:


In [ ]:
demeaned.groupby(key).mean()

We will soon see that demeaning can be achieved using apply as well.

Apply: General split-apply-combine

There are three data transformation tools that we can use to build analyses on our data. The first two, aggregate and transform, are somewhat rigid in their capabilities. On the flip side, this makes it easier on you the data analyst to perform data transformations. The third tool is apply, which gives you immense flexibility at the expense of intuitivity.

Returning to the tips.csv data set, suppose we want to select the top five tip_pct values by group. We can write a function to identify the top values of a DataFrame very easily:


In [ ]:
def top(df, n=5, column='tip_pct'):
    return df.sort_index(by=column)[-n:]
top(tips, n=6)

Now if we group by smoker, say, and call apply with this function, we get


In [ ]:
tips.groupby('smoker').apply(top)

top is called on each piece of the DataFrame, then the results are glued together using pandas.concat, labeling the pieces with the group names.

If you pass a function to apply that takes other arguments or keywords, you can pass these after the function:


In [ ]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Recall that describe seems to work okay on a GroupBy object.


In [ ]:
result = tips.groupby('smoker')['tip_pct'].describe()
result

In [ ]:
result.unstack('smoker')

What's really happening (for all you 151ers) is that when you invoke a method like describe, it is actually just a shortcut for:

f = lambda x: x.describe()
grouped.apply(f)

Suppressing the group keys

One point of style: if you prefer not working with hierarchical indices, you can specify in the groupby call to treat the underlying DataFrame as flat by choosing group_keys=False.


In [ ]:
tips.groupby('smoker', group_keys=False).apply(top)

Example: Filling missing values with group-specific values

When cleaning up missing data, in some cases you will filter out data observations using dropna, but in others you may want to impute (fill in) the NA values using a fixed value or some value derived from the data. fillna is the right tool to use; for example, we can fill in NA values with the mean:


In [ ]:
s = Series(np.random.randn(6))
s[::2] = np.nan
s

In [ ]:
s.fillna(s.mean())

Suppose you need the fill value to vary by group. As you may guess, you need only group the data and use apply with a function that calls fillna on each data chunk. Here is some sample data on some US states divided into eastern and western states.


In [ ]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

In [ ]:
data.groupby(group_key).mean()

We can fill the NA values using the group means:


In [ ]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

In another case, you might have pre-defined fill values in your code that vary by group. Since the groups have a name attribute set, internally, we can use that:


In [ ]:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)

Example: Random sampling and permutation

Suppose you wanted to draw a random sample from a large dataset for Monte Carlo simulation purposes or some other application. There are a number of ways to perform the "draws"; some are much more efficient than others. One way is to select the first K elements of np.random.permutation(N), where N is the size of your complete dataset and K the desired sample size. As a more fun example, here's a way to construct a deck of English-style playing cards:


In [ ]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (range(1, 11) + [10] * 3) * 4
base_names = ['A'] + range(2, 11) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = Series(card_val, index=cards)

So now we have a Series of length 52 whose index contains card names and values are the ones used in blackjack and other games (to keep things simple, I just let the ace be 1).


In [ ]:
deck[:13]

Now, based on what we've just discussed, drawing a hand of five cards from the desk could be written as:


In [ ]:
def draw(deck, n=5):
    return deck.take(np.random.permutation(len(deck))[:n])
draw(deck)

Suppose you wnted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use apply:


In [ ]:
get_suit = lambda card: card[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n=2)

In [ ]:
# alternatively
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

Example: Group weighted average and correlation

Under the split-apply-combine paradigm of groupby operations between columns in a DataFrame or two Series, such as group weighted average, become a routine affair. As an example, take this dataset containing group keys, values, and some weights:


In [ ]:
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df

The group weighted average by category would then be:


In [ ]:
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)

As a less trivial example, consider a data set from Yahoo! Finance containing end of day prices for a few stocks and the S&P 500 index (the SPX ticker):


In [ ]:
close_px = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)
close_px.info()

One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns (computed from percent changes) with SPX. Here is one way to do it:


In [ ]:
close_px[-4:]

In [ ]:
rets = close_px.pct_change().dropna()
spx_corr = lambda x: x.corrwith(x['SPX'])
by_year = rets.groupby(lambda x: x.year)
by_year.apply(spx_corr)

There is of course nothing to stop you from computing inter-column correlation:


In [ ]:
# Annual correlation of Apple with Microsoft
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

Pivot tables and Cross-tabulation

A pivot table is a data summerization tool. Pivot tables work by aggregating a table of data by keys, where the data is organized rectangularly with the group keys along the rows and columns. We can use pivot tables in Python by using the groupby methodology. Using DataFrames allows us to apply the pivot_table method and we can use the pandas.pivot_table function. Besides acting as a great way to access groupby, pivot_table can add partial totals or margins.

We can use the pivot_table to calculate the group means of people by sex and smoker.


In [ ]:
tips=pd.read_csv('tips.csv')
tips['tip_pct'] = tips['tip']/tips['total_bill']

tips.pivot_table(index=['sex', 'smoker'])

Suppose that now we only care about tip percentage, size of the group, and day of the week. We can put smoker in the columns and day in the rows, so that we yield a tables showing the group averages of tip_pct and size based on smoker and day.


In [ ]:
tips.pivot_table(['tip_pct','size'], index=['sex', 'day'],
                 columns='smoker')

Furthermore, if we also want data about general tipping percentages and size of parties without regard to people smoking, we can use the margins argument to calculate to corresponding group statistic. Using margins=True calculates the partial totals of each column.


In [ ]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker', margins=True)

The All columns show the average tipping percentage and size of parties without regard to smoking. To use a different aggregate function, we may use the aggfunc argument. For example we may use the len function to calculate the frequency of group sizes.


In [ ]:
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
                 aggfunc=len, margins=True)

To replace empty values with zero, we can use the fill_value argument.


In [ ]:
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum',fill_value=0)

Cross-tabulations: crosstab

A cross-tabulation is a special case of a pivot table that computes group frequencies.


In [ ]:
from StringIO import StringIO
data = """\
Sample    Gender    Handedness
1    Female    Right-handed
2    Male    Left-handed
3    Female    Right-handed
4    Male    Right-handed
5    Male    Left-handed
6    Male    Right-handed
7    Female    Right-handed
8    Female    Left-handed
9    Male    Right-handed
10    Female    Right-handed"""
data = pd.read_table(StringIO(data), sep='\s+')

data

We could use pivot_table to do this calculation, but pandas.crosstab is a convenient.


In [ ]:
pd.crosstab(data.Gender, data.Handedness, margins=True)

When using crosstab, we may use either an array or Series or a list of arrays.


In [ ]:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

Example: 2012 Federal Election Commission Database

We will be working with data from the 2012 US Presidential Election. This dataset focuses on campaign contributions for presidential candidates. The data can be loaded from:


In [ ]:
fec = pd.read_csv('P00000001-ALL.csv')

fec.info()

A sample data frame looks like this:


In [ ]:
fec.ix[123456]

One interesting aspect of this data set is the lack of partisanship as a way to classify candidates. We can add this information to the dataset. The way we are going to solve this problem is to create a dictionary indicating the politcal party of each candidate. First, we need to find out who all of the candidates are.


In [ ]:
unique_cands = fec.cand_nm.unique()
unique_cands

In [ ]:
unique_cands[2]

We use parties to specify a dictionary over all of the candidates.


In [ ]:
parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}

We can test our dictionary by viewing a section of the dataset to first view the candidate and then view their political affiliation.


In [ ]:
fec.cand_nm[123456:123461]

In [ ]:
fec.cand_nm[123456:123461].map(parties)

To calculate the number of contributions to each party, we use the value_counts function to sum the number of contributions of each party.


In [ ]:
# Add it as a column

fec['party'] = fec.cand_nm.map(parties)

fec['party'].value_counts()

Unfortunately, this counts both the positive and negative contributions to candidate's campaigns (negative values indcate refunds). Thus, to see the total number of donations to candidates in the 2012 US election we subset the receipt values.


In [ ]:
(fec.contb_receipt_amt > 0).value_counts()

To just use positive contibutions we use the following code.


In [ ]:
fec = fec[fec.contb_receipt_amt > 0]

fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

Donation statistics by occupation and employer

One interesting question is the occupation of donors for each party. For example, do lawyers donate more to Democrats or Republicans? To which party do business executives donate more money?


In [ ]:
fec.contbr_occupation.value_counts()[:10]

We can again use a dictionary to better define the occupation of the donors, as well as the employers of the donors.


In [ ]:
occ_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
   'C.E.O.': 'CEO'
}

In [ ]:
# If no mapping provided, return x
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)

emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}

In [ ]:
# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)

Using a pivot_table we can view data on people who donated at least \$2 million.


In [ ]:
by_occupation = fec.pivot_table('contb_receipt_amt',
                                index='contbr_occupation',
                                columns='party', aggfunc='sum')

over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm

In [ ]:
over_2mm.plot(kind='barh')

Alternatively, we can view donors who gave to the campaigns of Barack Obama or Mitt Romney. We do this by grouping by candidate name using the top method that we learned earlier.


In [ ]:
def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()

    # Order totals by key in descending order
    return totals.order(ascending=False)[-n:]

grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)

grouped.apply(get_top_amounts, 'contbr_employer', n=10)

Bucketing donation amounts

A useful way to analyze data is to use the cut function to partition the data into comparable buckets.


In [ ]:
bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels

Grouping the data by name and bin, we get a histogram by donation size.


In [ ]:
grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)

The data shows that Barack Obama received significantly more contributions of smaller donation sizes. We can also sum the contribution amounts and normalize the data to view a percentage of total donations of each size by candidate:


In [ ]:
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
bucket_sums

In [ ]:
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums

In [ ]:
normed_sums[:-2].plot(kind='barh', stacked=True)

Donation statistics by state

We can also aggregate donations by candidate and state:


In [ ]:
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
totals[:10]

Additionally, we may obtain the relative percentage of total donations by state for each candidate.


In [ ]:
percent = totals.div(totals.sum(1), axis=0)
percent[:10]